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Abstract 

Background: Many analyses of gene expression data involve hypothesis tests of an interaction term between two 
fixed effects, typically tested using a residual variance. In expression studies, the issue of variance heteroscedasticity 
has received much attention, and previous work has focused on either between-gene or within-gene 
heteroscedasticity. However, in a single experiment, heteroscedasticity may exist both within and between genes. 
Here we develop flexible shrinkage error estimators considering both between-gene and within-gene 
heteroscedasticity and use them to construct F-like test statistics for testing interactions, with cutoff values 
obtained by permutation. These permutation tests are complicated, and several permutation tests are investigated 
here. 

Results: Our proposed test statistics are compared with other existing shrinkage-type test statistics through 
extensive simulation studies and a real data example. The results show that the choice of permutation procedures 
has dramatically more influence on detection power than the choice of F or F-like test statistics. When both types 
of gene heteroscedasticity exist, our proposed test statistics can control preselected type-l errors and are more 
powerful. Raw data permutation is not valid in this setting. Whether unrestricted or restricted residual permutation 
should be used depends on the specific type of test statistic. 

Conclusions: The F-like test statistic that uses the proposed flexible shrinkage error estimator considering both 
types of gene heteroscedasticity and unrestricted residual permutation can provide a statistically valid and 
powerful test. Therefore, we recommended that it should always applied in the analysis of real gene expression 
data analysis to test an interaction term. 



Background 

The regulation of gene expression starts when a cell's 
DNA is transcribed into mRNA. The simultaneous 
expression profiles of many genes under different cir- 
cumstances can provide insight into physiological pro- 
cesses. Using modern technologies in gene expression 
experiments such as oligonucleotide arrays [1], and 
cDNA spotted arrays [2], many scientists have made 
novel discoveries about complex biological processes of 
yeast [3,4], drosophila [5], mice [6], humans [7], and 
other species. Recently one such study also included 
RNA-seq [8]. Statistical methodologies and issues 
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involved in microarray data analysis have been widely 
reviewed [9-12], and it is expected that many of the 
same issues will need to be addressed with RNA-seq. 

The analysis of variance (ANOVA) model is a popular 
statistical modeling method for the analysis of microar- 
rays. Since its introduction by Kerr et al. [13], it has 
been extensively examined for use in this setting 
[14-21]. Kerr et al. constructed an ANOVA model that 
included the gene effect as a fixed effect. This model 
assumes identically and independently distributed resi- 
dual errors across genes. The advantage of this model is 
that the large number of genes involved in a microarray 
experiment results in huge degrees of freedom for the 
error estimate, which can lead to a very powerful test. 
However, the common assumption of homoscedasticity 
may not hold true in this setting [22]. One alternative is 
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to use an ANOVA model for each gene, but the result- 
ing test statistics from gene-specific models may have 
limited power because the biological sample size for 
each gene in a microarray experiment is usually small 

To address this problem of limited power, researchers 
have proposed other methods for obtaining more infor- 
mation across genes, ranging from a simple equal- 
weighted average of a gene-specific error estimate and 
the global average of all gene-specific error estimates (F 2 
statistic proposed by Wu et al. [19] to empirical Baye- 
sian modeling of all gene-specific errors [23-26]. Other 
variations [27-29] used different variance modeling stra- 
tegies to address the heteroscedasticity problem, but no 
clear winner has emerged [30]. Huang and Liu [31] 
extended the test statistics proposed by Cui et al. [28] 
by assuming a normal distribution on the mean and 
then deriving an empirical Bayes likelihood ratio test. 
The resulting test statistic shrinks both the mean and 
variances. 

In addition to the problem of between-gene heterosce- 
dasticity, we must also be concerned with within-gene 
heteroscedasticity. For example, in the study of simple 
differential gene expression between a treatment group 
and a control group, the variance in the treatment arm 
may differ from that in the control arm. Some 
approaches to this problem include a general Bayesian 
framework to model heteroscedastic error in a single 
generalized linear mixed model setting [32] and a struc- 
tural model placed on the error variances specific to 
each gene and treatment combination [33]. 

As gene expression studies become more popular, the 
complexity of the experiment increases. Instead of only 
simple treatment and control experiments, two or more 
factor experiments are being conducted. This increase 
in experiment complexity has led to many scientific 
questions involving the hypothesis testing of an interac- 
tion between two factors. For example, testing a probe 
by genotype interaction can result in inferences about 
polymorphism in the probe, such as single nucleotide 
polymorphism (SNP) and insertion-deletion (indel) 
[34-37]; testing a probe by sex can imply that alternative 
splicing occurs between male and female subjects [38]; 
and in pharmacogenomic studies, testing the genotype- 
drug/treatment or genotype-disease interaction may be 
of interest [39]. Thus far, all the development of 
ANOVA methods for microarray studies has focused on 
tests of main effects. 

Here, a generalized shrinkage estimator incorporating 
both within- and between-gene heteroscedasticities is 
developed (see Lehmann and Cesella [40] for a review 
of shrinkage estimation). In any given experiment, both 
within-gene and between-gene heteroscedasticity may 
exist; thus, taking these possibilities into account should 



lead to an improved test statistic. Moreover, given the 
increasing complexity of recent studies and the bur- 
geoning interest in hypotheses that involve interactions, 
we focus on an improved shrinkage-based F-test for 
interaction terms. 

Methods 

Here we develop new shrinkage estimates for the error 
term and show how to use these estimates to construct 
F-like statistics. We then estimate the null distribution 
of these statistics by using permutation tests. 

Shrinkage error estimators 

Shrinkage error estimators pull individual error esti- 
mates toward shrinkage targets, with the amount of 
shrinkage depending on the variability of individual 
error estimates [28,40]. Let the gene-specific error esti- 
mates for all genes i and subgroups k be 

i = k = 1,..., K, and let cr 2 k be the true variance of 

gene i in group k. When the experimental design is 
balanced, <r? fe is the residual mean square for gene i in 
group k and va^/cr^ ~ Xv> where v represents the 
degrees of freedom for the error estimates. 

The choices of shrinkage targets in microarray data 
include the following: 

1. Specific values for each gene-group combination 

2. Gene-specific values that are the same across all 
other groups 

3. Group-specific values that are the same across 
genes but different across groups 

4. A single point representing the underlying com- 
mon error 

Correspondingly, these targets are correct when (1) 
there are both within-gene and between-gene heterosce- 
dasticity; (2) there is only between-gene heteroscedasti- 
city; (3) there is only within-gene heteroscedasticity; and 
(4) all error variances are identical. We now develop a 
generalized shrinkage error estimator using these four 
shrinkage targets. 

Let X i>k = log (J^ - m ~ log or? fe + logx^/v ~ m > where 
m is the mean of log logXy/ y - Then using asymptotic 
normal approximation of X i)k , the distribution of X i)k s 
with different shrinkage targets for different gene i and 
group k combinations is 

Xtt|0tt~N(0 a ,<r 2 ) 

Oi,k ~ N (/x + a,- + p k , t 2 ) , 
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where 0 = (<9 U , 0 ltKr .., 6 llt 0 IiK ) , a = (a lf a 7 ) 
represents the gene-specific mean differences, and 
P = (Pi,..., Pk) models different means with respect to 
different classes of the subgroups. 

If a 2 and r 2 are known, then the Bayes estimator of Q it 
k under the squared error loss is [39]: 



9 b 



(/x + ai + p k ) + — r~ 



?Gen-ce,i,k = ex P( m + £) 
[IK-3]a 2 



*exp[[ 1 



E(X a -/x) 2 



&*-£)]. 



(5) 



The shrinkage error estimator proposed by Cui et al. 
[28] shrinks the gene-specific error estimators toward 
their common corrected geometric mean. Specifically, 
the estimator for a 2 is calculated as 



Here, a 2 is the variance of log Xvl v an< ^ * s l< nown 
[28,40], but r 2 is not known. However, the marginal dis- 
tribution of X i)k can be used to create an empirical 
Bayes estimator of r 2 and hence of 6 i)k . Marginally, X i}k 
~ N(fd + OCi + /3/ 0 c 2 + r 2 ),/ = 1,..., I, k = 1, ...K, and, from 
this model, the least square estimates of fx, a, p, jl, a, p, 
are the uniformly minimum-variance and unbiased esti- 
mators. Using the fact that 

^ [IK-{I + K-l)-2] ^ 1 

E (— : — : = ^TTTi' 



the empirical Bayes estimator for r 2 is 
E(X a -jl-ai- p k ) 2 /[IK - (I + K - 1) - 2] - a 2 

Then, we can construct the positive-part empirical 
Bayes estimator [40]: 



2 EB+ 



Xvk + i - 



X i/k = fr + a>i + fa, 



[IK-(I + K- 1) -2]a : 



where(#) + = rnax(x, 0). The generalized shrinkage 
error estimate for o i>k can be obtained through exponen- 
tiating as follows: 



(2) 



Using a similar argument, the generalized shrinkage 
error estimator with the shrinkage target at each gene 
is 



*Gm-geneA,k = + & + & i) 

[IK-{I-l)-2]a 2 



*exp[l 1 



E(X X/fe - jl - OLif 



(X iik - fl - Sii)], 



(3) 



with the shrinkage target at each group is 

a Gen-grp,i,k = eX P( m + A + Pk) 

I [IK — (K — 1) — 2]cr 2 \ / . .(f) 

*exp[(l-i * / - \ [X ik -jl-p k )i 

\ s(Xi /fe - jx - Pk) / + 

and with the shrinkage target at the common error, 
we have 



Cui,i 



* exp[ 



exp I m + 



EX 



[7-3]^ 



V 



(6) 



where X, is the residual variance estimate from a 
gene-specific model, and m and a 2 are the mean and 

y 2 Kv 

variance of log _ The underlying assumption for 

Kv 

this estimator is that there is no between-gene hetero- 
scedasticity, as this estimator shrinks every gene-specific 
error estimator toward one target. Therefore, it will 
overshrink the gene-specific error estimates when gene 
heteroscedasticity exists. In comparison, generalized 
shrinkage error estimators are flexible in terms of incor- 
porating a different type of heteroscedasticity. Some 
degrees of freedom are used for incorporating the het- 
eroscedasticity. However, the gain is that the error esti- 
mator is then closer to the underlying distribution and 
should lead to better performance of the resultant F-like 
test statistics as shown in the results section. 

In formulas (2), (3), (5), and (6), m is the mean and a 2 
is the variance of a log-transformed chi-square random 
variable. The simulation-based approximate values of m 
and a 2 can be found from Table 1 in work of Cui et al. 
[28]. Pounds [41] gave analytical expressions for these 
parameters and developed R code for the exact calcula- 
tion. Here, the simulation-based approximate values 
were used. 

Shrinkage F-like statistics 

To construct a statistic for the hypothesis test of no 
interaction between two fixed effects, the traditional F- 
test is simply the ratio of the mean square of the inter- 
action term (MSI) and the mean square of residuals 
(MSE). This F-test, referred to as F x [42], is 

MSI MSI _ ^ 
Fx = = . The F 1 test corresponding to a speci- 

MSE o 2 
fic gene i is denoted by 



MSU 



(7) 
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Table 1 Results from raw data permutation 



Restricted? 


Data set 




F 2 


F 3 




F Gen 


^Gen-gene 


F ' Gen-grp 


YES 


null-ce 


5.05(0.07) 


5.06(0.08) 


5.12(0.17) 


5.09(0.10) 


5.09(0.10) 


5.05(0.08) 


5.11(0.10) 




null-gh 


5.02(0.07) 


5.13(0.16) 


5.26(0.20) 


5.03(0.07) 


5.07(0.12) 


5.03(0.07) 


5.11(0.16) 




null-wgh 


4.97(0.07) 


4.96(0.09) 


4.93(0.18) 


4.99(0.08) 


4.99(0.12) 


4.96(0.09) 


5.01(0.16) 




null-bgh 


5.02(0.07) 


4.99(0.17) 


5.03(0.21) 


5.02(0.07) 


5.02(0.15) 


5.01(0.09) 


5.03(0.18) 


NO 


null-ce 


5.10(0.07) 


5.06(0.08) 


5.06(0.08) 


7.4(0.12) 


5.15(0.09) 


5.12(0.08) 


5.08(0.08) 




null-gh 


5.08(0.07) 


5.12(0.16) 


5.12(0.12) 


7.4(0.09) 


5.10(0.11) 


5.07(0.10) 


5.12(0.09) 




null-wgh 


12.31(0.10) 


7.56(0.10) 


4.61(0.10) 


17.37(0.14) 


5.32(0.11) 


5.07(0.09) 


5.87(0.11) 




null-bgh 


12.31(0.11) 


6.63(0.17) 


5.55(0.19) 


15.68(0.12) 


6.30(0.12) 


6.10(0.11) 


6.30(0.11) 



CWER obtained from 1,000 permutations with the nominal significance level setting at 0.05, with standard errors in parentheses. Nine hundred simulation runs 
were performed to get empirical average CWER of all types of F-like test statistics. 



The error variance estimator in this test uses data 
from only gene i. In oligonucleotide mi-croarray models, 
the degrees of freedom for the error estimate can be 
small because the sample size of RNA is usually small, 
and hence the power of Fi can be limited. 

Following the method of constructing an F-test statis- 
tic given by Neter et al. [42], the gene-specific shrinkage 
F-like statistics for testing an interaction between two 
fixed effects can be obtained as 



FGen,i 
FGen— gene, i 
FGen—grp,i 
FGen—ce,i 
Fcui,i 



^Gen,J K ' 



MSh 



^ k ®Gen —gene, i,h 

MSh 



^ d Gen-grp,i,kl K ' 

MSU 

^ a Gen-ce,J K ' 



MSU 



Cui,i 



When the homoscedastic error assumption is true, the 
pooled variance estimator, or^ ooP can be used to construct 
an F-like statistic. For a balanced design, the pooled var- 
iance estimate is the average of all gene-specific error 
estimates. This statistic is denoted by F 3 using the same 
notation used by Cui and Churchill [22], who also intro- 
duced another shrinkage-type F statistic, F 2 , which can 
also borrow information across genes when estimating 
the residual variances. The statistic F 2 uses an equal- 
weighted average of a gene-specific error estimator £ 2 
and Vpooi- The definitions of F 2 ,i and F 3)i are 



MSU 



2,i 



0.5^ + 0.50^' 



MSU 

* 2 
a pool 



Permutation tests 

For the proposed generalized shrinkage F-like test statis- 
tics, the null distributions are not known named distri- 
butions. Therefore, an empirical approach such as a 
permutation test can be used to estimate the null distri- 
butions. The permutation test for interaction is compli- 
cated, because there is no exact permutation test for 
such a purpose [43]. We therefore must consider an 
approximate permutation method for testing an interac- 
tion term in a crossed fixed/mixed model [44,45]. 

Permutation approaches developed previously focused 
on a single ANOVA model. In the typical gene expres- 
sion study, thousands of ANOVA models are considered 
simultaneously. The additional complexity of the shrink- 
age F-like statistics indicates that Monte Carlo studies 
are needed to investigate the performance of residual 
permutation and raw data permutation, with restrictions 
or not, in a gene-expression analysis. The choice of per- 
mutation procedures is critical for assessing the perfor- 
mance of a test statistic. 

For all the modified F-like statistics presented in the 
previous section, the null distributions can only be 
approximated empirically, but permutation procedures 
can be used to find the approximate null distribution of 
all the F and F-like statistics. The important issues in 
performing a permutation analysis include the choice of 
the exchangeable units under the null hypothesis, the 
choice of using restricted permutation or not, and the 
choice of residual permutation or raw data permutation. 
These choices influence the power of a test statistic. 

Residual permutation using residuals from a reduced 
model and unrestricted raw data permutation can be 
used to approximate the null distribution of a statistic 
for testing an interaction term [44]. When using F x to 
test an interaction term in a single ANOVA model, the 
residual permutation leads to a more powerful test than 
unrestricted raw data permutation [44]. However, in 
gene expression analysis, thousands of gene-specific 
ANOVA models are simultaneously considered, and for 
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a particular gene-specific ANOVA model, information 
from other gene-specific ANOVA models is used to 
construct the shrinkage error estimate. Hence, both resi- 
dual permutation and raw data permutation were inves- 
tigated. Furthermore, both restricted and unrestricted 
permutations were studied, because the permutation 
units are exchangeable only within each particular 
group when within-gene heteroscedasticity is present 
across those subgroups. 

Results 

The properties of this shrinkage estimator are compared 
with those of other existing F and F-like statistics that 
have been proposed and described in the "Shrinkage F- 
like statistics" section. 

Simulation studies 

The purpose of these simulation studies was to compare 
the performances of F l9 F 2 , F 3 , F Cub F Gen , F Gen , gene , and 
^Gen-grp in terms of type I error and power and to com- 
pare the results of a particular F-like statistic using four 
different permutation strategies: restricted/unrestricted 
residual permutation and restricted/unrestricted raw 
data permutation. 

In these simulation studies, 100 genes with two probes 
for each gene and three replicates from each of two 
lines were simulated to mimic a split-plot design in a 
general oligonu-cleotide microarray experiment. The 
gene-specific ANOVA model in which data were gener- 
ated from the model, y ptr = P p + L t + RL rt + PL pt + e ptr , 
wp = 1, 2, / = 1,2, r = 1,2,3, where P, L, RL, and PL 
represent probe, line, replicates from a particular line, 
and the interaction between probe and line, respectively. 

Replicates were nested within each line, and RL is 
usually treated as a random effect during the model-fit- 
ting procedure, which results in a correlation between 
probes from the same biological sample. In the simu- 
lated data sets, the correlation between genes was 0. As 
many as 900 simulation runs were carried out to com- 
pare the performances of F l9 F 2 , F 3 , F Cub F Gen , F Gen . gene , 
and F Gen _ grp based on different permutation procedures. 
The four permutations tested were unrestricted residual 
permutation, restricted residual permutation with 
respect to each line, unrestricted raw data permutation, 
and restricted raw data permutation with respect to 
each line. The residuals permuted were from a reduced 
fixed model with fixed effects for only line and probe. 

Two types of data were simulated: null cases and cases 
with a probe by line interaction at a range of degrees. 
Null cases included: null-ce, all probe-level expression 
values were simulated from the standard normal distri- 
bution; null-gh, the gene-specific error variances were 
simulated from the log-normal distribution with mean 
log at 0 and standard deviation at 2, mimicking the 



general heteroscedastic error distribution in typical data- 
sets; null-wgh, all genes had the same error structures 
and the residual error variance of line 1 was 100 times 
that of line 2; null-bgh, simulated data were modified 
from null-gh, with the variance of line 1 multiplied by 
100. Correspondingly, ce, gh, wgh, and bgh in Figures 1 
and 2 were simulated by adding interaction terms to 
null-ce, null-gh, null-wgh, and null-bgh. Quantitative 
interaction was assumed and the differences in the 
opposite direction were set to make the detection 
powers for an interaction term based on traditional F- 
statistics and tabled p-values range from 0.05 to 0.95. 

Tables 1 and 2 show the results from 900 simulation 
runs using raw data permutation and residual permuta- 
tion, respectively. Data in Table 1 suggest that when 
both types of gene heteroscedasticity exist, the unrest- 
ricted raw data permutation had a greater average com- 
parison-wise error rate (CWER) than residual 
permutation. Raw data permutation with restriction can 
control prespecified CWER im all cases. In Table 2, for 
the common error cases, all test statistics had the pre- 
specified CWER from both restricted and unrestricted 
residual permutation. When within-gene heteroscedasti- 
city existed, F 1 and F Cu i had inflated CWER from both 
two residual permutation tests. Restricted residual per- 
mutation reduces, but does not solve, this problem. For 
F 2 and F 3 , only the restricted residual permutation could 
control the prespecified CWER. For F Gen , F Gen _ gene , and 
^Gen-grpy restricted residual permutation gave conserva- 
tive results in terms of having CWER smaller than the 
prespecified level. When the shrinkage target is correctly 
set, unrestricted residual permutation controls the nom- 
inal CWER. As expected, only F Gen coupled with unrest- 
ricted residual permutation could be used for all cases, 
because the CWER was always less than the nominal 
level. 

Further simulations to compare the rejection rates 
were conducted. Only results from residual permutation 
are shown because it was found that raw data permuta- 
tion was less powerful than residual permutation. This 
is consistent with the findings of Anderson and Ter 
Braak [44]. Figure 1 shows the estimated average null 
hypothesis rejection rate curves from all F-like statistics 
and both restricted and unrestricted residual permuta- 
tion procedures. The x-axis represents the average null 
hypothesis rejection rate using F x and the tabulated p- 
values. The solid line shows that the corresponding sta- 
tistic controls the prespecified CWER, and the dashed 
line shows that the corresponding CWER was inflated. 
In general, restricted residual permutation is less power- 
ful than unrestricted residual permutation. For example, 
the power of all statistics from unrestricted residual per- 
mutation almost doubled in some cases where hetero- 
scedasticity existed. 
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Figure 1 The comparison of power curves of all F-like test statistics. The x-axis is the average power from analyzing 900 simulated data 
sets using Fi with tabled p-values. The y-axis is the estimated powers using empirical gene-specific null distributions from 1,000 residual 
permutations. The upper four plots show the results with restricted residual permutation, while the lower four plots show the results from 
unrestricted residual permutation. The solid line indicates the empirical average CWER of a statistic is at the prespecified level, and the dashed 
line shows an inflated empirical average CWER."ce," all genes have common error; "gh," only between-gene heteroscedasticity exists; "wgh," only 
within-gene heteroscedasticity exists; "bgh," both between-gene and within-gene heteroscedasticity exist. 



When the common error assumption is valid, F 3 is 
obviously the most powerful test and the prespecified 
CWER is controlled. All other F-like statistics performed 
very similarly in this case. When the shrinkage target 
was correctly set, the resultant test statistic was the 
most powerful one. For example, when there was only 
within-gene heteroscedasticity, F Gen _ grp was more power- 
ful than F Gen and F Gen _ gene based on either restricted or 
unrestricted residual permutation. The rejection rate 
comparison of statistically valid test statistics is further 
illustrated in Figure 2, where the x-axis is the average 
rejection rate from using F Gen and unrestricted residual 
permutation. Figure 2 clearly shows that unrestricted 
residual permutation is more favorable in terms of 
power. F Gen _ grp appears to be more powerful than F Gew 



but when both types of gene heteroscedasticities occur, 
^Gen grp nas inflated CWER. 

Drosophila data 

The data used in this study are from a gene expression 
comparison study between D. melanogaster and D. simu- 
lans [46]. Expression of 10 genotypes of each species was 
measured in male flies. In D. simulans, each genotype was 
measured separately, and in D. melanogaster, a pool of 10 
genotypes was measured. All genotypes (individual or 
pooled) were independently isolated and hybridized three 
times. The goal of the original study was to provide a gen- 
ome-wide approach to identifying candidate genes poten- 
tially responsible for adaptation and speciation in D. 
simulans and D. melanogaster. In this study, we focus on 
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FGen.unres FGen.unres 

Figure 2 The comparison of power curves of F Gen from unrestricted residual permutation versus other F-like test statistics. Only results 

from permutation combinations that can control prespecified CWER are used in this figure. The x-axis is the average power after analyzing 900 

simulated data sets using F Gen and 1,000 unrestricted residual permutations. The y-axis is the estimated power from other F-like test statistics and 

empirical gene-specific null distributions based on the appropriate permutation. The solid black line corresponds to F Gen with unrestricted 

permutation, and this test always controls prespecified CWER."ce," all genes have common error; "gh," only between-gene heteroscedasticity 

exists; "wgh," only within-gene heteroscedasticity exists; "bgh," both between-gene and within gene heteroscedasticity exist; "res," restricted 

permutation; "unres," unrestricted permuation. 
I ) 



identifying sequence differences between genotypes in D. 
simulans based on hybridization profiles. Within-gene het- 
eroscedasticity is expected because the genotypes come 
from different lines. The proposed generalized shrinkage 
F-like test statistics F Gew F Gen _ gene , and F Gen _ grp were com- 
pared with F 2 , F 3 with restricted residual permutation, 
which could control prespecified CWER for any variance 
structure in simulation studies. Furthermore, Smyth's 
moderated F-test statistic [25] without multiple testing 



adjustment and controlling the false discovery rate (FDR) 
at 5% were used for comparison. As the main interest is in 
sequence difference, the focus is on the test of interaction 
between line and probe. The split plot model described 
above is used. SAS program codes are included in the 
additional files (additional file 1 and additional file 2). 

The Drosophila genome has been fully sequenced and 
both SNPs and indels can cause a significant interaction 
term. Thus, the false positive rate and detection power 



Table 2 Results from residual permutation 



Restricted? 


Data set 




F 2 


F 3 


Fcui 


FGen 


FGen-gene 


FGen-grp 


YES 


null-ce 


4.59(0.07) 


4.15(0.08) 


3.63(0.14) 


4.09(0.09) 


4.55(0.07) 


3.23(0.06) 


4.44(0.08) 




null-gh 


4.57(0.07) 


4.1(0.13) 


3.95(0.16) 


4.49(0.07) 


4.6(0.07) 


4.61(0.07) 


4.38(0.07) 




null-wgh 


6.74(0.08) 


4.33(0.08) 


3.51(0.14) 


6.49(0.09) 


4.38(0.07) 


4.2(0.07) 


2.78(0.09) 




null-bgh 


6.74(0.08) 


4.35(0.16) 


4.07(0.19) 


6.58(0.08) 


4.36(0.07) 


4.16(0.07) 


3.64(0.07) 


NO 


null-ce 


5.1(0.07) 


4.99(0.08) 


4.5(0.08) 


4.59(0.08) 


4.99(0.07) 


4.1(0.07) 


4.68(0.07) 




null-gh 


5.1(0.07) 


4.83(0.1) 


4.59(0.11) 


5.08(0.07) 


4.99(0.07) 


5.01(0.07) 


4.95(0.07) 




null-wgh 


10.75(0.09) 


8.46(0.09) 


7.6(0.09) 


12.37(0.11) 


5.03(0.08) 


6.43(0.08) 


4.93(0.08) 




null-bgh 


10.75(0.1) 


8.38(0.17) 


8.07(0.19) 


10.79(0.1) 


5.02(0.08) 


6.38(0.08) 


6.73(0.08) 



CWER obtained from 1,000 permutations with the nominal significance level setting at 0.05, with standard errors in parentheses. Nine hundred simulation runs 
were performed to get empirical average CWER of all types of F-like test statistics. 
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based on SNP/indel sequence information can be calcu- 
lated for a subset of the data. In the data set, there were 
10 lines from D. simulans and three replicates from 
each line. Each probe set had 14 probes. The 1,285 pro- 
besets containing all "good" probes were selected. A 
"bad" probe's sequence satisfies one or more of the fol- 
lowing criteria: it matches the D. simulans genome mul- 
tiple times; it cannot be mapped to the flybase 4.2.1 
genome; or, it has no information, such as hitting out- 
side an exon, hitting a poorly aligned region, or hitting a 
region lacking a sequence. SNP or indel information 
could be determined in 777 probesets. For this data set, 
there was a high degree of within-gene heteroscedasti- 
city: about 22.3% of the probe sets had a difference in 
line-specific residual variance estimates as large as or 
more than a 10-fold change. Therefore, as suggested by 
the conclusions from simulation studies, unrestricted 
residual permutation and restricted residual permutation 
were used for generalized shrinkage F-like test statistics 
(Fcem ^Gen-gene* F Gen-gr P ) and restricted residual permuta- 
tion was used for statistics (F 2 , F 3 ). The results are 
shown in Table 3. Consistent with the findings from the 
simulation studies, F Gen had about 30% more detecting 
power by valuing the within-gene heteroscedasticity 
than the other F-like test statistics (F 2 , F 3 ). The false dis- 
covery rate of F Gen was slightly higher than that of F 2 , 

-^"3* ^Gen-gene and F Gen 

-grp performed similarly to 

F Gen* 

Both of Smyth's moderated F-test statistic without mul- 
tiple testing adjustment and with FDR set at 5% for 
multiple testing adjustment detected more SNPs and 
indels but at the expense of a greater FDR than F Gen . 

Discussion 

For gene expression analysis, ANOVA models have been 
a popular modeling technique. Based on ANOVA mod- 
els, flexible shrinkage F-like test statistics were devel- 
oped to account for both the within-gene and between- 
gene heteroscedasticities. The emphasis here is on 



testing an interaction term, as this case is of increasing 
interest to biologists, and there is no clear existing the- 
ory on the most powerful, valid approach for such sta- 
tistics. For all F-like statistics studied here, their null 
distributions were approximated empirically through 
permutations. Four different permutation procedures 
were investigated for eight different F-like statistical 
tests of the interaction term. 

As expected, we found that when an error estimator 
overshrinks, the resulting F-like statistic cannot control 
the prespecified CWER. For example, F Gen _ gene is an 
over-shrinkage error estimator when there is within- 
gene heteroscedasticity. As a result, compared with gen- 
eralized shrinkage F-like statistics, it is not valid when 
within-gene heteroscedasticity exists. Undershrinkage is 
also important, as it will lead to a conservative test and 
lower power. This is clearly demonstrated when the 
common error can be assumed and the most powerful 
valid test is F Gen . grp . 

The most striking result was the impact of the per- 
mutation procedures. Although this was not comple- 
tely unexpected [43-45], the effect of the permutation 
procedures is dramatic and worthy of special attention. 
Unrestricted raw data permutation could not control 
prespeci-fied CWER when there was within-gene het- 
eroscedasticity. Restricted raw data permutation could 
be used, but it was less powerful than residual permu- 
tation. Also consistent with findings from Anderson 
and Ter Braak [44], restricted permutations are less 
powerful than unrestricted permutations. However, 
unrestricted permutations are valid only for a common 
error and when between-gene heteroscedasticity exists 
for our proposed shrinkage statistics; they are not valid 
in combination with F 2 , F 3 , or F Cu r F° r ^Gen-grpi the 
unrestricted permutation can also be used in cases 
having within-gene heteroscedasticity, while only F Gen 
is valid with unrestricted permutation in all cases in 
terms of controlling prespecified CWER. Interestingly, 



Table 3 Probe sets with significant line*probe terms found by F-like test statistics and appropriate residual 
permutation procedures and Smyth's moderated F-test statistic 



Test statistic Restricted permutation? Number of probe sets found True false discovery rate Power 



F 2 


Yes 


124 


22.6% 


12.4% 


F 3 


Yes 


187 


29.4% 


17.1% 


F Gen 


No 


453 


29.5% 


41.1% 


FGen-gene 


No 


455 


28.8% 


41.7% 


FGen-grp 


No 


474 


28.9% 


43.4% 


F Gen 


Yes 


136 


24.3% 


13.3% 


FGen-gene 


Yes 


122 


22.1% 


12.2% 


FGen grp 


Yes 


116 


21.5% 


1 1 .7% 


moderated F - 1 


N/A 


535 


34.1% 


75.5% 


moderated F - 2 


N/A 


813 


34.4% 


68.8% 



The CWER was set to 0.05. Gene-specific cutoff values were obtained from 1,000 permutations, "moderated F-1" and "moderated F-2" represent results from 
using moderated F statistic without any multiple testing adjustment and setting FDR to 5%. 
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the power gain from using the correct shrinkage target 
F Gen _ grp rather than F Gen is far less than that of using 
unrestricted permutation. The result is that F 3 is never 
the most powerful choice when testing an interaction 
term. 

The correct shrinkage target can lead to the most 
powerful test statistic. As one of the reviewers sug- 
gested, a statistical test may be applied to help pick the 
best shrinkage target before obtaining shrinkage error 
estimates. However, this extra testing step may inflate 
the CWER of the test statistic when there is gene het- 
eroscedasticity. For example, when there are both types 
of gene heteroscedasticities, it is possible that the above 
test suggests only within-gene heteroscedasticities exist, 
and F Gen _ grp is shown to inflate the CWER. There is 
minimal penalty to using the shrinkage estimator we 
propose, so we recommend setting the shrinkage target 
in the full space spanned by group and gene and using 
unrestricted permutation to compensate for the possible 
power loss in fewer degrees of freedom left for estimat- 
ing the errors. 

Conclusions 

The proposed generalized shrinkage F-like statistic with 
shrinkage targets located in a space spanned by gene 
and another group, F Gew with unrestricted residual per- 
mutation is always valid in terms of having a prespeci- 
fied CWER. This statistic has reasonable power in most 
cases; thus, it is generally recommended to be applied to 
test an interaction term in the analysis of real gene 
expression data. 

Additional material 



Additional file 1: SAS program code 1. SAS program code for 
analyzing the real data set using residual permutation without restriction. 

Additional file 2: SAS program code 2. SAS program code for 
analyzing the real data set using residual permutation with restriction. 
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